##Introduction

Accurately projecting each player’s score is critical in the dynamic field of basketball analytics. This number is critical in determining a player’s offensive ability and overall value to their club. It’s a tool used by coaches, analysts, and fans to evaluate scoring ability, assisting in game judgments, player recruitment, and scouting.

This notebook will investigate the use of several machine-learning approaches for forecasting basketball points. We will concentrate on four different regression models. The goal of the K-Nearest Neighbors (KNN) Regressor, Decision Tree Regressor (DT), and Random Forest Regressor (RFR) models is to anticipate a player’s total points based on a variety of performance indicators like as time spent playing, successful field goals, free throws, and so on.

Our goal is to evaluate and examine the efficacy of various models in forecasting basketball scores. This comparative analysis will help us understand the advantages and disadvantages of each technique, directing us to the most successful model for this specific dataset.

Join us as we delve into the fascinating world of basketball data analytics, assessing and comparing the predictions from each regression model to determine their effectiveness in projecting players’ scoring contributions.

This notebook contains tasks. • Dataset Overview: Learn about the basketball dataset’s structure and properties. • Import libraries: Add the libraries required for data manipulation and visualization.

• Read datasets and extract information from them: Load the dataset and collect preliminary insights.

• Data visualization: Use visualization to better comprehend the distribution and linkages of data.

• Features: Choose the features that will help you anticipate basketball points.

##Fashion modeling:

• KNeighbors Regressor: For point prediction, use the KNN Regressor.

• Decision Tree Regressor: Use the DT Regressor to predict points.

• Random Forest Regressor: Use RFR to predict points.

• Predictions visualization: Visualize and assess the predictions provided by each regression model.

# Load libraries
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.3
library(plotly)
## Warning: package 'plotly' was built under R version 4.2.3
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(caret)
## Warning: package 'caret' was built under R version 4.2.3
## Loading required package: lattice
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.2.3
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
library(kknn)
## Warning: package 'kknn' was built under R version 4.2.3
## 
## Attaching package: 'kknn'
## The following object is masked from 'package:caret':
## 
##     contr.dummy
library(rpart)
library(tree)
## Warning: package 'tree' was built under R version 4.2.3
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.2.3
# Supress warnings
options(warn=-1)

The DataFrame df in this code holds basketball performance statistics for players. The DataFrame columns have unique names that may or may not be user-friendly or self-explanatory. The purpose of this code is to rename the columns to make them more meaningful and intelligible.

# Load the dataset
df <- read.csv('2023_nba_player_stats.csv')

# View the first few rows of the dataset
head(df, 3)
##          PName POS Team Age GP  W  L    Min  PTS FGM  FGA  FG. X3PM X3PA X3P.
## 1 Jayson Tatum  SF  BOS  25 74 52 22 2732.2 2225 727 1559 46.6  240  686 35.0
## 2  Joel Embiid   C  PHI  29 66 43 23 2284.1 2183 728 1328 54.8   66  200 33.0
## 3  Luka Doncic  PG  DAL  24 66 33 33 2390.5 2138 719 1449 49.6  185  541 34.2
##   FTM FTA  FT. OREB DREB REB AST TOV STL BLK  PF   FP DD2 TD3 X...
## 1 531 622 85.4   78  571 649 342 213  78  51 160 3691  31   1  470
## 2 661 771 85.7  113  557 670 274 226  66 112 205 3706  39   1  424
## 3 515 694 74.2   54  515 569 529 236  90  33 166 3747  36  10  128
# Check the dimensions of the dataset
dim(df)
## [1] 539  30
# Check for duplicate rows
sum(duplicated(df))
## [1] 0
# Rename columns
names(df) <- c('Player_Name', 'Position', 'Team_Abbreviation', 'Age', 'Games_Played', 'Wins', 'Losses', 'Minutes_Played', 'Total_Points', 'Field_Goals_Made', 'Field_Goals_Attempted', 'Field_Goal_Percentage', 'Three_Point_FG_Made', 'Three_Point_FG_Attempted', 'Three_Point_FG_Percentage', 'Free_Throws_Made', 'Free_Throws_Attempted', 'Free_Throw_Percentage', 'Offensive_Rebounds', 'Defensive_Rebounds', 'Total_Rebounds', 'Assists', 'Turnovers', 'Steals', 'Blocks', 'Personal_Fouls', 'NBA_Fantasy_Points', 'Double_Doubles', 'Triple_Doubles', 'Plus_Minus')

# Display structure of the dataset
str(df)
## 'data.frame':    539 obs. of  30 variables:
##  $ Player_Name              : chr  "Jayson Tatum" "Joel Embiid" "Luka Doncic" "Shai Gilgeous-Alexander" ...
##  $ Position                 : chr  "SF" "C" "PG" "PG" ...
##  $ Team_Abbreviation        : chr  "BOS" "PHI" "DAL" "OKC" ...
##  $ Age                      : int  25 29 24 24 28 21 28 26 24 28 ...
##  $ Games_Played             : int  74 66 66 68 63 79 77 68 73 77 ...
##  $ Wins                     : int  52 43 33 33 47 40 44 44 38 38 ...
##  $ Losses                   : int  22 23 33 35 16 39 33 24 35 39 ...
##  $ Minutes_Played           : num  2732 2284 2390 2416 2024 ...
##  $ Total_Points             : int  2225 2183 2138 2135 1959 1946 1936 1922 1914 1913 ...
##  $ Field_Goals_Made         : int  727 728 719 704 707 707 658 679 597 673 ...
##  $ Field_Goals_Attempted    : int  1559 1328 1449 1381 1278 1541 1432 1402 1390 1388 ...
##  $ Field_Goal_Percentage    : num  46.6 54.8 49.6 51 55.3 45.9 45.9 48.4 42.9 48.5 ...
##  $ Three_Point_FG_Made      : int  240 66 185 58 47 213 218 245 154 204 ...
##  $ Three_Point_FG_Attempted : int  686 200 541 168 171 578 636 635 460 544 ...
##  $ Three_Point_FG_Percentage: num  35 33 34.2 34.5 27.5 36.9 34.3 38.6 33.5 37.5 ...
##  $ Free_Throws_Made         : int  531 661 515 669 498 319 402 319 566 363 ...
##  $ Free_Throws_Attempted    : int  622 771 694 739 772 422 531 368 639 428 ...
##  $ Free_Throw_Percentage    : num  85.4 85.7 74.2 90.5 64.5 75.6 75.7 86.7 88.6 84.8 ...
##  $ Offensive_Rebounds       : int  78 113 54 59 137 47 141 63 56 42 ...
##  $ Defensive_Rebounds       : int  571 557 515 270 605 411 626 226 161 303 ...
##  $ Total_Rebounds           : int  649 670 569 329 742 458 767 289 217 345 ...
##  $ Assists                  : int  342 274 529 371 359 350 316 301 741 327 ...
##  $ Turnovers                : int  213 226 236 192 246 259 216 180 300 194 ...
##  $ Steals                   : int  78 66 90 112 52 125 49 99 80 69 ...
##  $ Blocks                   : int  51 112 33 65 51 58 21 27 9 18 ...
##  $ Personal_Fouls           : int  160 205 166 192 197 186 233 168 104 159 ...
##  $ NBA_Fantasy_Points       : int  3691 3706 3747 3425 3451 3311 3324 2918 3253 2885 ...
##  $ Double_Doubles           : int  31 39 36 3 46 9 40 5 40 2 ...
##  $ Triple_Doubles           : int  1 1 10 0 6 0 0 0 0 0 ...
##  $ Plus_Minus               : int  470 424 128 149 341 97 170 338 100 18 ...
# Descriptive statistics for numeric variables
summary(df[sapply(df, is.numeric)])
##       Age         Games_Played        Wins           Losses     
##  Min.   :19.00   Min.   : 1.00   Min.   : 0.00   Min.   : 0.00  
##  1st Qu.:23.00   1st Qu.:30.50   1st Qu.:12.00   1st Qu.:14.00  
##  Median :25.00   Median :54.00   Median :25.00   Median :25.00  
##  Mean   :25.97   Mean   :48.04   Mean   :24.02   Mean   :24.02  
##  3rd Qu.:29.00   3rd Qu.:68.00   3rd Qu.:36.00   3rd Qu.:34.00  
##  Max.   :42.00   Max.   :83.00   Max.   :57.00   Max.   :60.00  
##  Minutes_Played    Total_Points    Field_Goals_Made Field_Goals_Attempted
##  Min.   :   1.0   Min.   :   0.0   Min.   :  0.0    Min.   :   0.0       
##  1st Qu.: 329.0   1st Qu.: 120.5   1st Qu.: 45.5    1st Qu.:  93.5       
##  Median : 970.2   Median : 374.0   Median :138.0    Median : 300.0       
##  Mean   :1103.6   Mean   : 523.4   Mean   :191.6    Mean   : 403.0       
##  3rd Qu.:1845.9   3rd Qu.: 769.5   3rd Qu.:283.5    3rd Qu.: 598.5       
##  Max.   :2963.2   Max.   :2225.0   Max.   :728.0    Max.   :1559.0       
##  Field_Goal_Percentage Three_Point_FG_Made Three_Point_FG_Attempted
##  Min.   :  0.00        Min.   :  0.00      Min.   :  0.0           
##  1st Qu.: 41.65        1st Qu.:  5.00      1st Qu.: 17.0           
##  Median : 45.50        Median : 36.00      Median :109.0           
##  Mean   : 46.33        Mean   : 56.32      Mean   :156.1           
##  3rd Qu.: 50.60        3rd Qu.: 92.00      3rd Qu.:249.5           
##  Max.   :100.00        Max.   :301.00      Max.   :731.0           
##  Three_Point_FG_Percentage Free_Throws_Made Free_Throws_Attempted
##  Min.   :  0.00            Min.   :  0.00   Min.   :  0.0        
##  1st Qu.: 28.10            1st Qu.: 13.50   1st Qu.: 18.0        
##  Median : 34.20            Median : 42.00   Median : 60.0        
##  Mean   : 31.53            Mean   : 83.95   Mean   :107.4        
##  3rd Qu.: 38.50            3rd Qu.:113.50   3rd Qu.:147.0        
##  Max.   :100.00            Max.   :669.00   Max.   :772.0        
##  Free_Throw_Percentage Offensive_Rebounds Defensive_Rebounds Total_Rebounds 
##  Min.   :  0.00        Min.   :  0.00     Min.   :  0.0      Min.   :  0.0  
##  1st Qu.: 66.70        1st Qu.: 10.00     1st Qu.: 36.5      1st Qu.: 50.5  
##  Median : 76.30        Median : 33.00     Median :118.0      Median :159.0  
##  Mean   : 71.99        Mean   : 47.62     Mean   :150.6      Mean   :198.3  
##  3rd Qu.: 84.10        3rd Qu.: 63.00     3rd Qu.:229.5      3rd Qu.:286.0  
##  Max.   :100.00        Max.   :274.00     Max.   :744.0      Max.   :973.0  
##     Assists        Turnovers         Steals           Blocks      
##  Min.   :  0.0   Min.   :  0.0   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.: 22.0   1st Qu.: 14.5   1st Qu.:  8.50   1st Qu.:  5.00  
##  Median : 69.0   Median : 44.0   Median : 28.00   Median : 13.00  
##  Mean   :115.5   Mean   : 61.3   Mean   : 33.27   Mean   : 21.24  
##  3rd Qu.:162.5   3rd Qu.: 92.5   3rd Qu.: 51.00   3rd Qu.: 28.00  
##  Max.   :741.0   Max.   :300.0   Max.   :128.00   Max.   :193.00  
##  Personal_Fouls   NBA_Fantasy_Points Double_Doubles   Triple_Doubles   
##  Min.   :  0.00   Min.   :  -1       Min.   : 0.000   Min.   : 0.0000  
##  1st Qu.: 32.00   1st Qu.: 254       1st Qu.: 0.000   1st Qu.: 0.0000  
##  Median : 86.00   Median : 810       Median : 0.000   Median : 0.0000  
##  Mean   : 91.18   Mean   :1037       Mean   : 4.011   Mean   : 0.2208  
##  3rd Qu.:140.00   3rd Qu.:1646       3rd Qu.: 3.000   3rd Qu.: 0.0000  
##  Max.   :279.00   Max.   :3842       Max.   :65.000   Max.   :29.0000  
##    Plus_Minus  
##  Min.   :-642  
##  1st Qu.: -70  
##  Median :  -7  
##  Mean   :   0  
##  3rd Qu.:  57  
##  Max.   : 640
# Descriptive statistics for categorical variables
summary(df[sapply(df, is.character)])
##  Player_Name          Position         Team_Abbreviation 
##  Length:539         Length:539         Length:539        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character

Histogram of Players Positions: This histogram is designed to show the distribution of players across different positions.

Bar Chart of Average Points Per Position This chart visualizes the average total points scored by players in each position. - Players at the position PG has the highest average total points followed by the position SG, SF, PF, and the others.

# Checking for missing values
colSums(is.na(df))
##               Player_Name                  Position         Team_Abbreviation 
##                         0                         0                         0 
##                       Age              Games_Played                      Wins 
##                         0                         0                         0 
##                    Losses            Minutes_Played              Total_Points 
##                         0                         0                         0 
##          Field_Goals_Made     Field_Goals_Attempted     Field_Goal_Percentage 
##                         0                         0                         0 
##       Three_Point_FG_Made  Three_Point_FG_Attempted Three_Point_FG_Percentage 
##                         0                         0                         0 
##          Free_Throws_Made     Free_Throws_Attempted     Free_Throw_Percentage 
##                         0                         0                         0 
##        Offensive_Rebounds        Defensive_Rebounds            Total_Rebounds 
##                         0                         0                         0 
##                   Assists                 Turnovers                    Steals 
##                         0                         0                         0 
##                    Blocks            Personal_Fouls        NBA_Fantasy_Points 
##                         0                         0                         0 
##            Double_Doubles            Triple_Doubles                Plus_Minus 
##                         0                         0                         0
# Handling missing values (e.g., filling NA in 'Position' with 'SG')
df$Position[is.na(df$Position)] <- 'SG'

# Histogram of 'Position' using ggplot2
library(ggplot2)
ggplot(df, aes(x = Position)) +
  geom_histogram(stat = "count", fill = "blue") +
  theme_minimal() +
  labs(title = 'Players Position Value Counts', x = 'Position', y = 'Count')

# Alternatively, using plotly
library(plotly)
fig <- plot_ly(df, x = ~Position, type = "histogram")
fig
# Average points per position using ggplot2
position_stats <- df %>%
  group_by(Position) %>%
  summarize(Average_Total_Points = mean(Total_Points, na.rm = TRUE))

ggplot(position_stats, aes(x = Position, y = Average_Total_Points, fill = Position)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = 'Average Points per Position', x = 'Position', y = 'Average Total Points')

# Alternatively, using plotly
fig <- plot_ly(position_stats, x = ~Position, y = ~Average_Total_Points, type = "bar")
fig
  1. Histogram of Player Ages:The plot will indicate how many participants are in each age group. Because the bin width is set to one, each bar represents a one-year age interval. This picture aids in comprehending the age demographics of the participants.
  1. Scatter Plots: Age vs. Total Points:Each point represents a player, with the X-axis representing age and the Y-axis representing total points. Position color coding provides for differentiation between positions, potentially indicating patterns or discrepancies in scoring between positions.

Age vs. Field Goal Percentage:This graph depicts the association between player age and field goal percentage.It’s useful to know if there’s a link between a player’s age and their shooting effectiveness. - More field goals were accounted within the age of 23 to 25 and the least was accounted from 31 onward.

Age vs. Assists : This graph shows how the ages of players correlate with the number of assists they make.It aids in determining whether certain age groups are more susceptible. - Although the scatter plot doesn’t show much of a variation in the number of assists for different age group, it does however show players from which position tend to assist more and we can observe a high number of assists from PG position.

3)Bar Charts: Average Fantasy Points by Position:The average fantasy points scored by players in each position are shown in this bar chart.It allows you to compare the average fantasy point performance of players in various positions. - We can observe that, the players at the position PG has got the highest fantasy points and the players at the position G has gotten the least.

# Histogram of Player Ages
ggplot(df, aes(x = Age)) +
  geom_histogram(binwidth = 1, fill = "Blue") +
  labs(title = "Distribution of Player Ages", x = "Age", y = "Count")

# Scatter Plots
# Age vs. Total Points
ggplot(df, aes(x = Age, y = Total_Points, color = Position)) +
  geom_point() +
  labs(title = "Player Age vs Total Points", x = "Age", y = "Total Points")

# Age vs. Field Goal Percentage
ggplot(df, aes(x = Age, y = Field_Goal_Percentage, color = Position)) +
  geom_point() +
  labs(title = "Player Age vs Field Goal Percentage", x = "Age", y = "Field Goal Percentage")

# Age vs. Assists
ggplot(df, aes(x = Age, y = Assists, color = Position)) +
  geom_point() +
  labs(title = "Player Age vs Assists", x = "Age", y = "Assists")

# Bar Charts
# Average Fantasy Points by Position
avg_fantasy_points <- df %>%
  group_by(Position) %>%
  summarize(Avg_Fantasy_Points = mean(NBA_Fantasy_Points, na.rm = TRUE))
ggplot(avg_fantasy_points, aes(x = Position, y = Avg_Fantasy_Points, fill = Position)) +
  geom_bar(stat = "identity") +
  labs(title = "Average Fantasy Points by Position", x = "Position", y = "Average Fantasy Points")

# Double and Triple Doubles by Position
double_doubles_by_position <- df %>%
  group_by(Position) %>%
  summarize(Double_Doubles = sum(Double_Doubles, na.rm = TRUE))

triple_doubles_by_position <- df %>%
  group_by(Position) %>%
  summarize(Triple_Doubles = sum(Triple_Doubles, na.rm = TRUE))

#Boxplots

These box plots can identify patterns and abnormalities in a variety of player performance measures. For example, a box plot for ‘Total Points’ may show the distribution of points scored by players throughout different games or seasons, highlighting the average range of scores as well as any extraordinary performances.

This visualization method is very useful in exploratory data analysis since it provides insights into the nature of the data that can inform subsequent studies or model-building.

These box plots are used to visually and compare the distributions of various basketball-related statistics, assisting in the understanding of the data structure, identifying any data quality issues, and gaining insights into player performances.

library(ggplot2)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:randomForest':
## 
##     combine
## The following object is masked from 'package:dplyr':
## 
##     combine
# Prepare data (excluding certain columns)
column_to_exclude <- c('Player_Name', 'Position', 'Team_Abbreviation')
columns <- setdiff(names(df), column_to_exclude)

# Create a list to store plots
plot_list <- vector("list", length(columns))

# Adjust the number of rows and columns for the layout
num_columns_layout <- 2  # You can adjust this
num_rows_layout <- ceiling(length(plot_list) / num_columns_layout)

for (i in seq_along(columns)) {
    p <- ggplot(df, aes_string(y = columns[i])) +
         geom_boxplot() +
         theme_minimal() +
         ggtitle(paste("Box Plot of", columns[i]))
    print(p)  # Display each plot individually
}

The dataset is preprocessed in this code portion by separating it into training and testing sets. X stores the independent variables (features), while y stores the dependent variable (target). To divide the data into training and testing subsets, we used the train_test_split function. The training set contains 80% of the data, with the remaining 20% in the testing set. For reproducibility, the random state is set to 42.

The dimensions of the training and testing sets are printed to the console after splitting. X_train has rows_train and columns_train, whereas X_test contains rows_test and columns_test.

# Assuming df is your dataframe and 'Total_Points' is the target variable
set.seed(5555)  # for reproducibility

# Remove columns whose names start with "PName"
updatedDf <- df[, !grepl("^Player_Name", names(df))]


trainIndex <- createDataPartition(updatedDf$Total_Points, p = .7, list = FALSE)
dataTrain <- updatedDf[ trainIndex,]
dataTest  <- updatedDf[-trainIndex,]

##Random Forest A Random Forest regressor model is subjected to hyper parameter adjustment in this part to improve its performance. The goal is to determine the best hyper parameter combination that produces the highest R2 score on the test data. For experimentation, a set of test sizes and random states is defined. The dataset is repeatedly split into training and testing sets using varying test sizes and random states within nested loops. A Random Forest regressor model with specific hyper parameters, including 100 estimators and a maximum depth of 5, is instantiated for each combination of test size and random state. On the training data (X_train and y_train), the model is trained. The R2 score is calculated using r2_score on the testing data (X_test).

library(randomForest)
library(caret)
library(Metrics)
## 
## Attaching package: 'Metrics'
## The following objects are masked from 'package:caret':
## 
##     precision, recall
# Train the Random Forest model
rf_model <- randomForest(Total_Points ~ ., data = dataTrain, ntree = 5)

# Make predictions
rf_predictions <- predict(rf_model, dataTest)

# Evaluate the model
rf_mse <- mse(dataTest$Total_Points, rf_predictions)
rf_r2 <- R2(dataTest$Total_Points, rf_predictions)
rf_mae <- mae(dataTest$Total_Points, rf_predictions)



print(paste("MSE:", rf_mse))
## [1] "MSE: 2853.47921756944"
print(paste("R2 score:", rf_r2))
## [1] "R2 score: 0.987580929929807"
print(paste("mae score:", rf_mae))
## [1] "mae score: 32.8888125"

KNN Model

Training and Evaluation of the K-Nearest Neighbors (KNN) Regressor Model

A K-Nearest Neighbors (KNN) regressor model is trained and assessed on the dataset in this code section. Basketball players’ ‘Total_Points’ are predicted using the KNN regressor. For reproducibility, the dataset is divided into training and testing sets with varying test sizes (15%, 20%, 25%, and 30%) and random states.The model is trained on the training data (X_train and y_train) for each combination of test size and random state. The trained model is then used to forecast the ‘Total_Points’ on the testing data (X_test), with the results saved in y_pred.

The R2 score is calculated using the r2_score function, which quantifies the fraction of the variance in the dependent variable that is predictable from the independent variables. If the calculated R2 score is higher than the previous best result, the test size, random state, and R2 score are updated.

The code reports the optimal test size, random state, and R2 score obtained by the KNN regressor after iterating through all combinations.

#Library for the KNN model 
library(kknn)

# Train the KNN model
head(dataTest)
##    Position Team_Abbreviation Age Games_Played Wins Losses Minutes_Played
## 1        SF               BOS  25           74   52     22         2732.2
## 9        PG               ATL  24           73   38     35         2540.7
## 10       SG               CHI  28           77   38     39         2767.9
## 16       PF               UTA  25           66   32     34         2272.5
## 18       SG               HOU  21           76   20     56         2602.2
## 30       SG               GSW  33           69   38     31         2278.9
##    Total_Points Field_Goals_Made Field_Goals_Attempted Field_Goal_Percentage
## 1          2225              727                  1559                  46.6
## 9          1914              597                  1390                  42.9
## 10         1913              673                  1388                  48.5
## 16         1691              571                  1144                  49.9
## 18         1683              566                  1359                  41.6
## 30         1509              546                  1252                  43.6
##    Three_Point_FG_Made Three_Point_FG_Attempted Three_Point_FG_Percentage
## 1                  240                      686                      35.0
## 9                  154                      460                      33.5
## 10                 204                      544                      37.5
## 16                 200                      510                      39.2
## 18                 187                      554                      33.8
## 30                 301                      731                      41.2
##    Free_Throws_Made Free_Throws_Attempted Free_Throw_Percentage
## 1               531                   622                  85.4
## 9               566                   639                  88.6
## 10              363                   428                  84.8
## 16              349                   399                  87.5
## 18              364                   463                  78.6
## 30              116                   132                  87.9
##    Offensive_Rebounds Defensive_Rebounds Total_Rebounds Assists Turnovers
## 1                  78                571            649     342       213
## 9                  56                161            217     741       300
## 10                 42                303            345     327       194
## 16                130                440            570     123       127
## 18                 43                241            284     281       200
## 30                 39                247            286     163       123
##    Steals Blocks Personal_Fouls NBA_Fantasy_Points Double_Doubles
## 1      78     51            160               3691             31
## 9      80      9            104               3253             40
## 10     69     18            159               2885              2
## 16     42     38            137               2673             28
## 18     59     18            131               2476              0
## 30     49     29            130               2208              2
##    Triple_Doubles Plus_Minus
## 1               1        470
## 9               0        100
## 10              0         18
## 16              0        163
## 18              0       -447
## 30              0        163
# scale the data for KNN
knn_model <- kknn(Total_Points ~ ., train = dataTrain, test= dataTest, k = 1)

# Make predictions
knn_predictions <- fitted(knn_model)

# Evaluating the model
knn_mse <- mse(dataTest$Total_Points, knn_predictions)
knn_r2 <- R2(dataTest$Total_Points, knn_predictions)
knn_mae <- mae(dataTest$Total_Points, knn_predictions)


print(paste("KNN - MSE:", knn_mse))
## [1] "KNN - MSE: 17012.7125"
print(paste("KNN - R2 score:", knn_r2))
## [1] "KNN - R2 score: 0.927435691388151"
print(paste("KNN - MAE score:", knn_mae))
## [1] "KNN - MAE score: 94.3625"

##Decision Tree Regressor Tuning In this code section, a grid search strategy is used to tune hyperparameters for a Decision Tree regressor model. The goal is to find the ideal hyperparameters that maximize the R2 score, which indicates the predictive performance of the model. The dataset is divided between training and testing sets based on different test sizes (10%, 15%, 20%, and 30%) and random states (0, 1, 42, 43, 100, 313). The default hyperparameters are used to initialize a Decision Tree regressor model. The model is trained on the training data (X_train and y_train) for each combination of test size and random state, and predictions are generated on the testing data (X_test). The R2 score is calculated with r2_score and compared to the best R2 score available.

library(rpart)
# Train the Decision Tree model
dt_model <- rpart(Total_Points ~ ., data = dataTrain, method = "anova")

# Make predictions
dt_predictions <- predict(dt_model, dataTest, type = "vector")

# Evaluate the model
dt_mse <- mse(dataTest$Total_Points, dt_predictions)
dt_r2 <- R2(dataTest$Total_Points, dt_predictions)
dt_mae <- mae(dataTest$Total_Points, dt_predictions)

print(paste("Decision Tree - MSE:", dt_mse))
## [1] "Decision Tree - MSE: 8689.63837496918"
print(paste("Decision Tree - R2 score:", dt_r2))
## [1] "Decision Tree - R2 score: 0.962413315013407"
print(paste("Decision Tree - MAE score:", dt_mae))
## [1] "Decision Tree - MAE score: 70.0075546120961"

#Model Evaluation Visual comparisons between projected and actual points are generated in this section using several sorts of graphs.

Scatter Plot: A scatter plot is created to compare the actual (x-axis) and expected (y-axis) locations. Based on the actual points, each point is color-coded. The scatter plot function in Plotly is used to construct the plot.

A residual plot is created to show the disparities between the actual points and the anticipated points. The difference between the actual and anticipated points is used to determine the residuals. At y=0, a dashed orange line helps you visualize the divergence from the ideal line. The scatter plot function in Plotly is used to construct the plot.

Predicted vs. True Line Plot: This plot shows a comparison of the true (x-axis) and predicted (y-axis) values. The expected values are represented by an ideal line, regression line, and scatter plot. The linear relationship between the true and expected values is represented by the regression line. The scatter plot function in Plotly is used to create the plot. Plotly’s show() function is used to display each plot.

The three plots you’ve created—scatter plot of actual vs. projected, residual plot, and line plot comparing predicted to true values—are crucial tools in regression analysis for evaluating predictive model performance. Each visualization provides a unique perspective on the accuracy and qualities of your model’s predictions.

library(randomForest)
library(ggplot2)
library(plotly)

# Make predictions using the Random Forest model
rf_predictions <- predict(rf_model, dataTest)

# Create a dataframe for comparison
comparison_df <- data.frame(Actual = dataTest$Total_Points, Predicted = rf_predictions)

# Calculate residuals
comparison_df$Residuals <- comparison_df$Actual - comparison_df$Predicted


fig_scatter <- ggplot(comparison_df, aes(x = Actual, y = Predicted, color = Actual)) +
  geom_point() +
  ggtitle("Comparison of Actual vs. Predicted") +
  labs(x = "Actual Points", y = "Predicted Points") +
  theme_minimal()

ggplotly(fig_scatter) # Convert to interactive plotly plot
fig_residual <- ggplot(comparison_df, aes(x = Predicted, y = Residuals)) +
  geom_point(color = "orangered") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "orange") +
  ggtitle("Residual Plot") +
  labs(x = "Predicted Values", y = "Residuals") +
  theme_minimal()

ggplotly(fig_residual) # Convert to interactive plotly plot
fig_line <- ggplot(comparison_df, aes(x = Actual, y = Predicted)) +
  geom_point(color = "Green") +
  geom_line(aes(y = Actual), color = "#98DFD6") +
  geom_smooth(method = lm, color = "#FFDD83", se = FALSE) +
  ggtitle("Predicted vs. True Line Plot") +
  labs(x = "True Values", y = "Predicted Values") +
  theme_minimal()

ggplotly(fig_line) # Convert to interactive plotly plot
## `geom_smooth()` using formula = 'y ~ x'

#Conclution: Based on your successful development and evaluation of the Decision Tree, K-Nearest Neighbors (KNN), and Random Forest regression models for predicting basketball points in an NBA dataset, the following conclusion can be drawn:

Model Performance and Accuracy:

The Random Forest Regressor was the most accurate of the three models (Decision Tree, KNN, and Random Forest). This implies that Random Forest’s ensemble technique, which mixes numerous decision trees to provide more robust and generalized predictions, is well-suited for this type of data. While the Decision Tree model is intuitively simpler and easier to read, it may not have captured the complexity in the data as well as the Random Forest model. Overfitting is a common problem for decision trees, especially when dealing with complicated and diverse datasets like those used in sports analytics. The KNN model, which makes predictions based on distance measurements, may not have performed well. This could be owing to the high dimensionality of the data or the necessity for careful feature scaling and parameter selection (‘k’). KNN models are highly sensitive to data scale, and any imbalances or outliers can have a major impact on their performance.

Insights from Model Evaluation:

The evaluation criteria used to analyze the models (such as MSE, R2, and MAE) provided a full understanding of their prediction capabilities. Lower MSE and MAE values, as well as higher R2 values, demonstrate the Random Forest model’s increased ability to anticipate total points scored by NBA players. Such measurements are essential not only for determining model correctness, but also for comprehending the types of mistakes created by each model. This knowledge can be used to guide future model development and optimization.